Bulk RNA seq-analysis of samples coming from brain, heart, and kidney tissues. For each of the three tissues, three samples were selected. Bulk RNA-seq analysis will be carried out using Bioconductor package egdeR for differential expression analysis of RNA-seq expression profiles.
brain<-get(load("rse_gene_brain_9_scaled.Rdata"))
brain
heart<-get(load("rse_gene_heart_6_scaled.Rdata"))
heart
kidney<-get(load("rse_gene_kidney_7_scaled.Rdata"))
data_brain<-brain[,c(2,3,4)]
data_heart<-heart[,c(10,1,2)]
data_kidney<-kidney[,c(7,8,9)]
save(data_brain,file="brain.RData")
save(data_heart,file= "heart.RData")
save(data_kidney,file="kidney.RData")load("brain.RData")
load("heart.RData")
load("kidney.RData")
data.frame("Brain"=colnames(data_brain),"Heart"=colnames(data_heart),"Kidney"=colnames(data_kidney))%>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| Brain | Heart | Kidney |
|---|---|---|
| SRR2167642 | SRR1387702 | SRR1469746 |
| SRR607445 | SRR606939 | SRR1071807 |
| SRR663753 | SRR1468613 | SRR1486080 |
Information related to each single sample have been retrieved by the looking up samples in the GTEx portal as well as by recovering metadata stored in the RangedSummarizedExperiment as shown in the recount2 vignette, making use of recount Bioconductor package. Unfortunately many NAs are present in the metadata.
## sample ID names
info<-function(data){
data<-add_predictions(data)
info<-cbind(colnames(data),data$sampid,data$smtsd,data$smtstptref,as.vector(data$reported_sex))
colnames(info)<-c("Sample","Sample GTEX ID","Region","Death","Sex")
info %>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)
}
info_brain<-info(data_brain)
info_heart<-info(data_heart)
info_kidney<-info(data_kidney)| Sample | Sample GTEX ID | Region | Death | Sex |
|---|---|---|---|---|
| SRR2167642 | GTEX-T6MN-0011-R11A-SM-5S2TH | Brain - Cerebellar Hemisphere | Actual Death | male |
| SRR607445 | GTEX-OHPN-0011-R4A-SM-2I5FD | Brain - Amygdala | Actual Death | female |
| SRR663753 | GTEX-PVOW-2526-SM-2XCF7 | Brain - Cortex | Actual Death | male |
| Sample | Sample GTEX ID | Region | Death | Sex |
|---|---|---|---|---|
| SRR1387702 | GTEX-PWCY-0526-SM-5PNU8 | Heart - Left Ventricle | Actual Death | female |
| SRR606939 | GTEX-POMQ-0326-SM-2I5FO | Heart - Left Ventricle | Actual Death | female |
| SRR1468613 | GTEX-Y5V6-0826-SM-4VBRU | Heart - Left Ventricle | Actual Death | male |
| Sample | Sample GTEX ID | Region | Death | Sex |
|---|---|---|---|---|
| SRR1469746 | GTEX-11GS4-2326-SM-5A5KS | Kidney - Cortex | Actual Death | male |
| SRR1071807 | GTEX-T5JC-1526-SM-4DM68 | Kidney - Cortex | Actual Death | male |
| SRR1486080 | GTEX-1399S-0526-SM-5IJG8 | Kidney - Cortex | Actual Death | female |
Count data were pre-processed in order to remove short genes ( <200 bp) and mitochondrial genes. Genes which are both short and mitochrondrial are considered as short.
X<-SummarizedExperiment()
pre_processing<-function(X){
data<-as.data.frame(X@rowRanges)
read_counts<-as.data.frame(assay(X))
tot_reads<-colSums(read_counts)
read_count_short_genes<-read_counts[data$gene_id[data$bp_length < 200],]
tot_short_genes<-as.numeric(colSums(read_count_short_genes))
data<-data[data$bp_length>=200,]
read_counts<-read_counts[rownames(data),]
MT_genes<-rownames(data[data$seqnames =="chrM",])
read_count_MT_genes<-read_counts[MT_genes,]
tot_MT_genes<-as.numeric(colSums(read_count_MT_genes))
read_counts<-read_counts[data$gene_id %notin% MT_genes ,]
return(list(tot_reads,tot_short_genes,tot_MT_genes,read_counts))
}
brain<-pre_processing(data_brain)
brain_counts<-brain[[4]]
heart<-pre_processing(data_heart)
heart_counts<-heart[[4]]
kidney<-pre_processing(data_kidney)
kidney_counts<-kidney[[4]]
counts_table<-cbind(brain_counts,heart_counts,kidney_counts)Summary table displaying percentages of short genes and mitochondrial genes for each replicate. If a gene is both “short” and on the MT DNA, it only counts towards “short” genes total.
table<-data.frame(row.names = colnames(counts_table))
table$tot_reads<-c(brain[[1]],heart[[1]],kidney[[1]])
table$tot_short_reads<-c(brain[[2]],heart[[2]],kidney[[2]])
table$percentage_short_reads<-(table$tot_short_reads /table$tot_reads) *100
table$tot_MT_reads<-c(brain[[3]],heart[[3]],kidney[[3]])
table$percentage_MT_reads<-(table$tot_MT_reads /table$tot_reads) *100 | Tot_Reads | Tot_Short | % Tot_Short | Tot_MT | % Tot_MT | Tissue | |
|---|---|---|---|---|---|---|
| SRR2167642 | 32602806 | 34505 | 0.1058344 | 1810929 | 5.554519 | Brain |
| SRR607445 | 35456664 | 207924 | 0.5864173 | 18885940 | 53.264853 | Brain |
| SRR663753 | 37607691 | 76773 | 0.2041418 | 7726701 | 20.545534 | Brain |
| SRR1387702 | 39505018 | 91175 | 0.2307935 | 10113998 | 25.601806 | Heart |
| SRR606939 | 38788919 | 105861 | 0.2729156 | 12467758 | 32.142577 | Heart |
| SRR1468613 | 39229440 | 121991 | 0.3109680 | 14936858 | 38.075634 | Heart |
| SRR1469746 | 37448629 | 81223 | 0.2168918 | 15420307 | 41.177227 | Kidney |
| SRR1071807 | 37288595 | 83581 | 0.2241463 | 15028849 | 40.304144 | Kidney |
| SRR1486080 | 37979700 | 119836 | 0.3155265 | 5634358 | 14.835183 | Kidney |
DGEList objectBefore moving on with the subsequent analysis, the table obtained above, containg the counts for each gene across the 9 samples, must be converted in a DGEList object. Samples are also labeled according to the tissue they come from (Brain, Heart, Kidney).
bulk_table<-DGEList(counts=counts_table)
# define a tissue variable specifying the tissue each replicates belong to
tissue <- as.factor(c("Brain", "Brain", "Brain", "Heart", "Heart", "Heart", "Kidney", "Kidney", "Kidney"))
bulk_table$samples$group <- tissue
head(bulk_table) An object of class "DGEList"
$counts
SRR2167642 SRR607445 SRR663753 SRR1387702 SRR606939
ENSG00000000003.14 126 302 218 148 182
ENSG00000000005.5 1 3 8 1 4
ENSG00000000419.12 1061 229 792 715 561
ENSG00000000457.13 772 196 466 233 293
ENSG00000000460.16 603 102 214 99 137
ENSG00000000938.12 51 250 172 168 127
SRR1468613 SRR1469746 SRR1071807 SRR1486080
ENSG00000000003.14 179 1161 619 1095
ENSG00000000005.5 5 4 6 45
ENSG00000000419.12 574 579 589 685
ENSG00000000457.13 263 355 333 723
ENSG00000000460.16 124 172 159 357
ENSG00000000938.12 281 325 124 1368
$samples
group lib.size norm.factors
SRR2167642 Brain 30757372 1
SRR607445 Brain 16362800 1
SRR663753 Brain 29804217 1
SRR1387702 Heart 29299845 1
SRR606939 Heart 26215300 1
SRR1468613 Heart 24170591 1
SRR1469746 Kidney 21947099 1
SRR1071807 Kidney 22176165 1
SRR1486080 Kidney 32225506 1
Given the large amount of genes, it is recommendable to filter out all genes that have zero or really low counts. Also, gene counts will be log-normalized following this step. Count values smaller than 1 will give negative values which may affect gene comparisons. egdeR performs the filtering taking into account also the library size of each sample.
zero_count<-table(rowSums(bulk_table$counts==0)==9) # 7,114 genes
genes_to_keep <- filterByExpr(bulk_table,group = tissue)
# Determine which genes have sufficiently large counts to be retained in a statistical analysis.
bulk_table<- bulk_table[genes_to_keep,, keep.lib.sizes=FALSE]A total of 7,114 genes have zero count. After filtering, only 24,446 were retained for subsequent analysis.
Counts are normalized according to the egdeR’s default method TMM. TMM normalization is a simple yet robust way to estimate the ratio of RNA production and uses a weighted trimmed mean of the log expression ratios.
An object of class "DGEList"
$counts
SRR2167642 SRR607445 SRR663753 SRR1387702 SRR606939
ENSG00000000003.14 126 302 218 148 182
ENSG00000000419.12 1061 229 792 715 561
ENSG00000000457.13 772 196 466 233 293
ENSG00000000460.16 603 102 214 99 137
ENSG00000000938.12 51 250 172 168 127
SRR1468613 SRR1469746 SRR1071807 SRR1486080
ENSG00000000003.14 179 1161 619 1095
ENSG00000000419.12 574 579 589 685
ENSG00000000457.13 263 355 333 723
ENSG00000000460.16 124 172 159 357
ENSG00000000938.12 281 325 124 1368
24441 more rows ...
$samples
group lib.size norm.factors
SRR2167642 Brain 30654722 1.1005668
SRR607445 Brain 16317541 1.0477393
SRR663753 Brain 29702207 1.0970365
SRR1387702 Heart 29281810 0.7231013
SRR606939 Heart 26189859 0.7597152
SRR1468613 Heart 24151973 0.8002918
SRR1469746 Kidney 21908829 1.2789012
SRR1071807 Kidney 22137700 1.1036751
SRR1486080 Kidney 32057855 1.2738922
The logarithm of the normalized counts was also taken. Boxplots below display the distribution of the normalized-log counts for each sample.
logcpm <- cpm(bulk_table, log=TRUE)
par(mar = c(3, 3, 3, 7), xpd = TRUE)
boxplot(logcpm,outline=FALSE,col=rep(c("cornflowerblue","coral2","darkorange2"),each=3), id=colnames(counts_table))
legend("right",inset = c(- 0.11, 0),title ="Tissue", legend=c("Brain","Heart","Kidney"),fill=c("cornflowerblue","coral2","darkorange2"))We built a design matrix with no intercept. There is no base condition the tissues can be compared with since we have tissues coming from completely different areas of the body.
# Building the design matrix
# no intercept
design_matrix <- model.matrix(~0+group, data=bulk_table$samples)
colnames(design_matrix) <- levels(bulk_table$samples$group)
design_matrix Brain Heart Kidney
SRR2167642 1 0 0
SRR607445 1 0 0
SRR663753 1 0 0
SRR1387702 0 1 0
SRR606939 0 1 0
SRR1468613 0 1 0
SRR1469746 0 0 1
SRR1071807 0 0 1
SRR1486080 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
Samples are plotted in a new two-dimensinal space in order to check whether samples coming from the tissue cluster well together and are far apart from clusters of the other tissues. The new space is defined using dimensionality reduction techniques in order to find components along which the data are more separated, i.e. components having the largest variance. For bulk RNA-seq, Principal Component Analysis (PCA) and Multi Dimensional Scaling (MDS) are mostly used for visualization purposes.
MDS is a non linear transformation which attempts to preserve relative distances. The log2 fold change(log2FC) values among the samples are employed instead of expression counts. Choose option plotMDS-Replicate to see MDS projection with replicates being labeled with corresponding sample name.
Samples belonging to the same tissues seem to be clustering well together. They identify 3 separate clusters, each representing a different tissue.
par(mar = c(3, 3, 3, 7), xpd = TRUE)
plotMDS(logcpm, labels=c("SRR2167642-B", "SRR607445-B" ,"SRR663753-B" , "SRR1387702-H", "SRR606939-H", "SRR1468613-H" ,"SRR1469746-K" ,"SRR1071807-K", "SRR1486080-K"),col=rep(c("cornflowerblue","coral2","darkorange2"),each=3))
legend("right",inset = c(- 0.11, 0),title ="Tissue", legend=c("Brain","Heart","Kidney"),fill=c("cornflowerblue","coral2","darkorange2"))PCA attempts to find a linear transformation of the data that produces combinations of variables with maximum variation between samples while preserving relative distances. PCA is a special case of MDS.
Samples belonging to the same tissues seem to be clustering well together. They identify 3 separate clusters, each representing a different tissue.
Counts are assumed to have a Negative Binomial distribution which allows to account for both technical and biological varaibility. Dispersion parameter, which summarizes the biological variability, must be estimated directly from the data since it directly depends on the sample being analysed. Genewise biological coefficient of variation is plotted against the average log CPM (gene abundance). EgdeRalso estimates the common dispersion for all genes (red curve).
bulk_table <- estimateDisp(bulk_table, design_matrix)
# Plot the genewise biological coefficient of variation (BCV) against gene abundance (in log2 counts per million).
plotBCV(bulk_table)[1] 0.328325
The estimated common dispersion parameter has a value of 0.328.
To find the genes that are differently expressed in the different tissue we need to fit a generalized linear model using the design matrix built above. A generalized linear model is fitted beacuse the gene counts have a negative binomial distribution, whereas, in order to be able to fit a traditional linear model, the outcome of interest (counts) must be normally distributed.
In ordert to test differentially expressed genes, glmQLFTest. Contrasts must be specified. Tissue having contrast equal to 1 is the tissue for which up-regulated genes and down regulated genes are found related to other tissue, contrast of -1. Since multiple tests are carried out simultaneously, the p-value for each test should be adjusted for multiple correction (False Discovery Rate,FDR). The default adjustment method used by egdeR is the Benjamini & Hochberg correction. Different thresholds for the significance level were considered so as to find the more statistically significant genes, i.e the ones for which is less likely that genes being tested come from the same distribution. Genes are considered differentially expressed if the adjusted p-value (FDR) is less than the significance level. A threshold on log2 fold-change (absolute value of log2FC > 1) was also added to make selection more stringent after lowering the significance level (stricter threshold). Log2FC is defined as the log-ratio of TMM values for a pair of expressed genes. A significance level of \(\alpha\) = 0.01 was chosen. Adding a threshold on log2FC did not make any difference on the total amount of DE genes found.
Genes differentially expressed in Heart tissue with respect to Brain tissue.
qlf_HvsB<-glmQLFTest(model_glm, contrast=c(-1,1,0))
FDR_HvsB <- p.adjust(qlf_HvsB$table$PValue, method="BH")
tot_genes.0.5<-sum(FDR_HvsB< 0.05)
#summary(decideTests(qlf_HvsB)) # has as default FDR (BH adjusted p-value)
#summary(decideTests(qlf_HvsB, p.value=0.01))
summary(decideTests(qlf_HvsB, p.value=0.01,lfc=1)) -1*Brain 1*Heart
Down 2056
NotSig 20915
Up 1475
df_HvsB <- topTags(qlf_HvsB, n=20000, adjust.method = "BH", sort.by = "PValue", p.value = 0.01)$table
head(df_HvsB)%>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| logFC | logCPM | F | PValue | FDR | |
|---|---|---|---|---|---|
| ENSG00000168280.16 | -9.037983 | 8.605205 | 372.0951 | 0e+00 | 0.0001692 |
| ENSG00000174521.7 | -8.092650 | 4.111864 | 325.5484 | 0e+00 | 0.0001692 |
| ENSG00000267257.1 | 9.719105 | 4.687175 | 271.1180 | 0e+00 | 0.0001806 |
| ENSG00000168243.10 | -7.641706 | 4.206588 | 268.7566 | 0e+00 | 0.0001806 |
| ENSG00000110076.18 | -7.141565 | 6.110606 | 259.4211 | 0e+00 | 0.0001806 |
| ENSG00000198796.6 | 9.689517 | 6.603624 | 242.2253 | 1e-07 | 0.0001806 |
Genes differentially expressed in Kidney tissue with respect to Brain tissue.
qlf_KvsB<-glmQLFTest(model_glm, contrast=c(-1,0,1))
FDR_KvsB <- p.adjust(qlf_KvsB$table$PValue, method="BH")
tot_genes.0.5<-sum(FDR_KvsB< 0.05)
#summary(decideTests(qlf_KvsB))
#summary(decideTests(qlf_KvsB, p.value=0.01))
summary(decideTests(qlf_KvsB, p.value=0.01,lfc=1)) -1*Brain 1*Kidney
Down 1614
NotSig 21669
Up 1163
df_KvsB <- topTags(qlf_KvsB, n=20000, adjust.method = "BH", sort.by = "PValue", p.value = 0.01)$table
head(df_KvsB)%>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| logFC | logCPM | F | PValue | FDR | |
|---|---|---|---|---|---|
| ENSG00000265168.1 | -5.660853 | 5.975233 | 308.1608 | 0e+00 | 0.0002183 |
| ENSG00000168280.16 | -7.751925 | 8.605205 | 306.2097 | 0e+00 | 0.0002183 |
| ENSG00000166589.12 | 13.427811 | 7.895476 | 280.1667 | 0e+00 | 0.0002183 |
| ENSG00000174521.7 | -6.679155 | 4.111864 | 264.2833 | 0e+00 | 0.0002183 |
| ENSG00000196104.10 | -9.053960 | 4.820608 | 230.1301 | 1e-07 | 0.0002534 |
| ENSG00000162728.4 | -10.381074 | 6.139604 | 221.1241 | 1e-07 | 0.0002534 |
Genes differentially expressed in Kidney tissue with respect to Heart tissue.
qlf_KvsH<-glmQLFTest(model_glm, contrast=c(0,-1,1))
FDR_KvsH <- p.adjust(qlf_KvsH$table$PValue, method="BH")
tot_genes.0.5<-sum(FDR_KvsH< 0.05)
#summary(decideTests(qlf_KvsH))
#summary(decideTests(qlf_KvsH,p.value = 0.01))
summary(decideTests(qlf_KvsH,p.value = 0.01,lfc=1)) -1*Heart 1*Kidney
Down 954
NotSig 22487
Up 1005
df_KvsH<- topTags(qlf_KvsH, n=20000, adjust.method = "BH", sort.by = "PValue", p.value = 0.01)$table
head(df_KvsH)%>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| logFC | logCPM | F | PValue | FDR | |
|---|---|---|---|---|---|
| ENSG00000186439.12 | -10.016047 | 6.725510 | 288.3992 | 0e+00 | 0.0001996 |
| ENSG00000265168.1 | -5.374136 | 5.975233 | 285.1522 | 0e+00 | 0.0001996 |
| ENSG00000214097.4 | -10.847967 | 4.524362 | 284.1654 | 0e+00 | 0.0001996 |
| ENSG00000166589.12 | 11.570320 | 7.895476 | 244.5069 | 1e-07 | 0.0001996 |
| ENSG00000144035.3 | 11.442332 | 5.137774 | 239.0738 | 1e-07 | 0.0001996 |
| ENSG00000113396.12 | -9.941702 | 3.569841 | 228.8086 | 1e-07 | 0.0001996 |
After identifying the differentially expressed genes, both up-regulated and down-regulated, in one tissue with respect to the other, all genes up-regulated (same for down_regulated genes) in one tissue with respect to both (i.e genes up-regulated in brain both when compared to heart and kidney tissues) were retrieved.
# finding DE genes in Brain wrt Heart
up_genes_BvsH<-down_genes_HvsB
down_genes_BvsH<-up_genes_HvsB
# finding DE genes in Brain wrt Kidney
up_genes_BvsK<-down_genes_KvsB
down_genes_BvsK<-up_genes_KvsB
# finding DE genes in Heart wrt Kidney
up_genes_HvsK <- down_genes_KvsH
down_genes_HvsK<-up_genes_KvsH
# up-regulated genes in Brain wrt to other tissues
up_genes_brain<-up_genes_BvsH[up_genes_BvsH %in% up_genes_BvsK]
# down-regulated genes in Brain wrt to other tissues
down_genes_brain<-down_genes_BvsH[down_genes_BvsH %in% down_genes_BvsK]
# up-regulated genes in Heart wrt to other tissues
up_genes_heart<-up_genes_HvsB[up_genes_HvsB %in% up_genes_HvsK]
# down-regulated genes in Heart wrt to other tissues
down_genes_heart<-down_genes_HvsB[down_genes_HvsB %in% down_genes_HvsK]
# up-regulated genes in Kidney wrt to other tissues
up_genes_kidney<-up_genes_KvsB[up_genes_KvsB %in% up_genes_KvsH]
# down-regulated genes in Kidney wrt to other tissues
down_genes_kidney<-down_genes_KvsB[down_genes_KvsB %in% down_genes_KvsH]df_HvsB$DE_Heart<-ifelse(df_HvsB$logFC>1,"UP-reg","DOWN-reg")
df_HvsB$DE_Brain<-ifelse(df_HvsB$logFC>1,"DOWN-reg","UP-reg")
write.csv(df_HvsB[,c(7,6)],"genes_DE_HvsB.csv")
df_KvsB$DE_Kidney<-ifelse(df_KvsB$logFC>1,"UP-reg","DOWN-reg")
df_KvsB$DE_Brain<-ifelse(df_KvsB$logFC>1,"DOWN-reg","UP-reg")
write.csv(df_KvsB[,c(7,6)],"genes_DE_KvsB.csv")
df_KvsH$DE_Kidney<-ifelse(df_KvsH$logFC>1,"UP-reg","DOWN-reg")
df_KvsH$DE_Heart<-ifelse(df_KvsH$logFC>1,"DOWN-reg","UP-reg")
write.csv(df_KvsH[,c(7,6)],"genes_DE_KvsH.csv")
write.csv(up_genes_brain,"genes_up_brain.csv")
write.csv(down_genes_brain,"genes_down_brain.csv")
write.csv(up_genes_heart,"genes_up_heart.csv")
write.csv(down_genes_heart,"genes_down_heart.csv")
write.csv(up_genes_kidney,"genes_up_kidney.csv")
write.csv(down_genes_kidney,"genes_down_kidney.csv")Gene ID have the correspective version after the dot, ensemble gene ID do not, hence we need to remove the part reffering to the version. The list of genes up-regulated in each tissue was exported as cvs file and used for subsequent functional enrichment analysis.
Annotation functional enrichhment analysis
library("org.Hs.eg.db")
annotation<-function(genes){
ensembl_gene_ID<- gsub("\\..*", "", genes)
genes_ens <- as.data.frame(mapIds(org.Hs.eg.db,keys=ensembl_gene_ID, keytype="ENSEMBL", column="SYMBOL", multiVals="first"))
genes_ens_geneID<-rownames(genes_ens)
genes_ens_labels<-genes_ens$`mapIds(org.Hs.eg.db, keys = ensembl_gene_ID, keytype = "ENSEMBL", column = "SYMBOL", multiVals = "first")`
# gene IDs
write.table(genes_ens_geneID,paste0(deparse(substitute(genes)),".txt"),col.names = F,row.names = F,quote=F)
# gene symbols
write.table(genes_ens_labels[is.na(genes_ens_labels)==F],paste0(deparse(substitute(genes)),"_labels.txt"),col.names = F,row.names = F,quote=F)
}
annotation(up_genes_brain)
annotation(up_genes_heart)
annotation(up_genes_kidney)Find the gene up-regulated in Brain tissue vs Heart tissue with the lowest FDR and which is also up-regulated also in Brain tissue vs Kidney tissue.
#head(rownames(df_HvsB[df_HvsB$logFC < -1,]
#head(rownames(df_KvsB[df_KvsB$logFC < -1 ,]))
min_UP_brain<-rownames(df_HvsB[which.min(df_HvsB[df_HvsB$logFC < -1,]$FDR),])
min_UP_brain %in% rownames(df_KvsB[df_KvsB$logFC < -1 ,])[1] TRUE
[1] "ENSG00000168280.16"
Among all different Bioconductor packages that allow you to carry out differential expression analysis, edgeR is the least conservative, i.e. more powerful, thereby more likely to find a statistically significant result. Limma (Linear Models for MicroArrayis) is another package for analysis of gene expression data stemming from analysis of microarrays. In limma, counts (given enough replicates are present) is assumed to be normally distributed, hence only mean and variance must be estimated, no dispersion parameter is needed. Since the distribution of the counts is approximately normal, a traditional linear model can be fitted to the data and a simple t-test can be employed to test whether genes are differentially expressed across the two tissues being compared. Limma bit more conservative than edgeR, so we should obtain a lower number of differentially expressed genes.
Like edgeR, limma uses log-normalized counts per million (TMM). The authors of limma are the same as edgeR, so the packages share some of the steps in the analysis process. They only differ in the model fitted to the gene expression data.
Using the same DGEList object created above for analysis with edgeR without including the part involving the estimation of dispersion parameter.
Building the desing matrix.
design <- model.matrix(~0+group, data=bulk_table_limma$samples)
colnames(design) <- gsub("group", "", colnames(design))
design Brain Heart Kidney
SRR2167642 1 0 0
SRR607445 1 0 0
SRR663753 1 0 0
SRR1387702 0 1 0
SRR606939 0 1 0
SRR1468613 0 1 0
SRR1469746 0 0 1
SRR1071807 0 0 1
SRR1486080 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
Building the constrast matrix.
contr.matrix <- makeContrasts(
HeartvsBrain = Heart- Brain,
KidneyvsBrain = Kidney-Brain,
KidneyvsHeart = Kidney-Heart,
levels = colnames(design))
contr.matrix %>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| HeartvsBrain | KidneyvsBrain | KidneyvsHeart | |
|---|---|---|---|
| Brain | -1 | -1 | 0 |
| Heart | 1 | 0 | -1 |
| Kidney | 0 | 1 | 1 |
voom transformation is applied to the normalized and filtered DGEList object.
To find the genes that are differentially expressed in limma we use a traditional linear model as already discussed above.
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
plotSA(efit)Summary results for all tests performed.
summary(decideTests(efit))%>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| HeartvsBrain | KidneyvsBrain | KidneyvsHeart | |
|---|---|---|---|
| Down | 3382 | 3357 | 2373 |
| NotSig | 17518 | 17523 | 19952 |
| Up | 3546 | 3566 | 2121 |
Same threshold parameters of the egdeR analysis are used for the selection of signficant differentially expressed genes. Significance level of \(\alpha\) = 0.01 and absolute value of log2FC > 1.
Genes differentially expressed in Heart tissue with respect to Brain tissue.
HvsB_l<- topTreat(efit, coef=1, n=Inf,adjust.method = "BH", sort.by = "p", p.value = 0.01)
up_genes_HvsB_l<- row.names(HvsB_l[HvsB_l$logFC > 1,])
down_genes_HvsB_l <- row.names(HvsB_l[HvsB_l$logFC < -1,])
#nrow(HvsB_l)
head(HvsB_l)%>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|
| ENSG00000168280.16 | -9.044419 | 4.574942 | -25.16726 | 0 | 1.57e-05 | 11.08187 |
| ENSG00000173991.5 | 9.484989 | 5.475126 | 22.82238 | 0 | 1.94e-05 | 11.19243 |
| ENSG00000237298.9 | 7.840217 | 6.423091 | 19.87601 | 0 | 2.90e-05 | 10.68048 |
| ENSG00000121577.13 | 8.397964 | 4.414277 | 19.70534 | 0 | 2.90e-05 | 10.01234 |
| ENSG00000122367.19 | 7.826212 | 5.796136 | 19.61430 | 0 | 2.90e-05 | 10.58620 |
| ENSG00000100170.9 | 10.085167 | 2.199401 | 19.11611 | 0 | 2.90e-05 | 9.20982 |
Genes differentially expressed in Kidney tissue with respect to Brain tissue.
KvsB_l <- topTreat(efit, coef=2, n=Inf,adjust.method = "BH", sort.by = "p", p.value = 0.01)
up_genes_KvsB_l<- row.names(KvsB_l[KvsB_l$logFC > 1,])
down_genes_KvsB_l <- row.names(KvsB_l[KvsB_l$logFC < -1,])
#nrow(KvsB_l)
head(KvsB_l)%>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|
| ENSG00000168280.16 | -7.740418 | 4.5749420 | -25.41409 | 0 | 1.43e-05 | 11.883900 |
| ENSG00000166589.12 | 13.127350 | 1.1755814 | 17.72422 | 0 | 5.00e-05 | 8.290850 |
| ENSG00000242366.3 | 13.126162 | -0.6949378 | 17.56966 | 0 | 5.00e-05 | 8.163250 |
| ENSG00000243135.6 | 13.124438 | -0.6955107 | 17.55710 | 0 | 5.00e-05 | 8.159920 |
| ENSG00000240224.1 | 13.124704 | -0.6954226 | 17.55442 | 0 | 5.00e-05 | 8.159204 |
| ENSG00000244474.5 | 13.134763 | -0.6920660 | 17.55356 | 0 | 5.00e-05 | 8.158920 |
Genes differentially expressed in Kidney tissue with respect to Heart tissue.
KvsH_l<- topTreat(efit, coef=3, n=Inf,adjust.method = "BH", sort.by = "p", p.value = 0.01)
up_genes_KvsH_l<- row.names(KvsH_l[KvsH_l$logFC > 1,])
down_genes_KvsH_l <- row.names(KvsH_l[KvsH_l$logFC < -1,])
#row(KvsH_l)
head(KvsH_l)%>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|
| ENSG00000173991.5 | -10.284285 | 5.475126 | -23.34910 | 0 | 2.75e-05 | 11.108162 |
| ENSG00000186439.12 | -10.022586 | 2.436215 | -20.89037 | 0 | 2.75e-05 | 9.613491 |
| ENSG00000122367.19 | -9.411969 | 5.796136 | -20.43170 | 0 | 2.75e-05 | 10.476033 |
| ENSG00000237298.9 | -7.754575 | 6.423091 | -19.99943 | 0 | 2.75e-05 | 10.765637 |
| ENSG00000155657.25 | -8.873335 | 7.927762 | -19.56746 | 0 | 2.75e-05 | 10.682676 |
| ENSG00000283186.1 | -8.567486 | 7.690718 | -19.41187 | 0 | 2.75e-05 | 10.619962 |
After identifying the differentially expressed genes, both up-regulated and down-regulated, in one tissue with respect to the other, all genes up-regulated (same for down_regulated genes) in one tissue with respect to both (i.e genes up-regulated in brain both when compared to heart and kidney tissues) were retrieved.
# finding DE genes in Brain wrt Heart
up_genes_BvsH_l<-down_genes_HvsB_l
down_genes_BvsH_l<-up_genes_HvsB_l
# finding DE genes in Brain wrt Kidney
up_genes_BvsK_l<-down_genes_KvsB_l
down_genes_BvsK_l<-up_genes_KvsB_l
# finding DE genes in Heart wrt Kidney
up_genes_HvsK_l <- down_genes_KvsH_l
down_genes_HvsK_l<-up_genes_KvsH_l
# up-regulated genes in Brain wrt to other tissues
up_genes_brain_limma<-up_genes_BvsH_l[up_genes_BvsH_l %in% up_genes_BvsK_l]
# down-regulated genes in Brain wrt to other tissues
down_genes_brain_limma<-down_genes_BvsH_l[down_genes_BvsH_l %in% down_genes_BvsK_l]
# up-regulated genes in Heart wrt to other tissues
up_genes_heart_limma<-up_genes_HvsB_l[up_genes_HvsB_l %in% up_genes_HvsK_l]
# down-regulated genes in Heart wrt to other tissues
down_genes_heart_limma<-down_genes_HvsB_l[down_genes_HvsB_l %in% down_genes_HvsK_l]
# up-regulated genes in Kidney wrt to other tissues
up_genes_kidney_limma<-up_genes_KvsB_l[up_genes_KvsB_l %in% up_genes_KvsH_l]
# down-regulated genes in Kidney wrt to other tissues
down_genes_kidney_limma<-down_genes_KvsB_l[down_genes_KvsB_l %in% down_genes_KvsH_l]comparison<-cbind(c("Brain","Heart","Kidney"),c(length(up_genes_brain),length(up_genes_heart),length(up_genes_kidney)),c(length(up_genes_brain_limma),length(up_genes_heart_limma),length(up_genes_kidney_limma)),c(length(down_genes_brain),length(down_genes_heart),length(down_genes_kidney)),c(length(down_genes_brain_limma),length(down_genes_heart_limma),length(down_genes_kidney_limma)),c(sum(up_genes_brain %in% up_genes_brain_limma),sum(up_genes_heart %in%up_genes_heart_limma),sum(up_genes_kidney %in% up_genes_kidney_limma)),c(sum(down_genes_brain %in% down_genes_brain_limma),sum(down_genes_heart %in% down_genes_heart_limma),sum(down_genes_kidney %in% down_genes_kidney_limma)))
colnames(comparison)<-c("Tissue","UP-edgeR","UP-limma","DOWN-edgeR","DOWN-limma","UP-both","DOWN-both")
comparison%>%
kable(caption = '',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| Tissue | UP-edgeR | UP-limma | DOWN-edgeR | DOWN-limma | UP-both | DOWN-both |
|---|---|---|---|---|---|---|
| Brain | 1120 | 1060 | 278 | 411 | 935 | 245 |
| Heart | 585 | 778 | 271 | 138 | 580 | 130 |
| Kidney | 512 | 629 | 169 | 147 | 464 | 126 |
Checking if the same gene returned by analysis with limma is the same returned in above analysis.
min_UP_brain_limma<-rownames(HvsB_l[which.min(HvsB_l[HvsB_l$logFC < -1,]$adj.P.Val),])
min_UP_brain_limma %in% rownames(KvsB_l[KvsB_l$logFC < -1 ,])[1] TRUE
[1] "ENSG00000168280.16"
Indeed, the same gene, ENSG00000168280.16, is returned.
annotation<-function(genes){
ensembl_gene_ID<- gsub("\\..*", "", genes)
genes_ens <- as.data.frame(mapIds(org.Hs.eg.db,keys=ensembl_gene_ID, keytype="ENSEMBL", column="SYMBOL", multiVals="first"))
genes_ens_geneID<-rownames(genes_ens)
genes_ens_labels<-genes_ens$`mapIds(org.Hs.eg.db, keys = ensembl_gene_ID, keytype = "ENSEMBL", column = "SYMBOL", multiVals = "first")`
# gene IDs
write.table(genes_ens_geneID,paste0(deparse(substitute(genes)),".txt"),col.names = F,row.names = F,quote=F)
# gene symbols
write.table(genes_ens_labels[is.na(genes_ens_labels)==F],paste0(deparse(substitute(genes)),"_labels.txt"),col.names = F,row.names = F,quote=F)
}
annotation(up_genes_brain_limma)
annotation(up_genes_heart_limma)
annotation(up_genes_kidney_limma)